大规模并行处理器编程：实践指南：突破顺序计算的瓶颈

“免费午餐”时代的终结

数十年来，开发者曾享受着“顺序计算天花板”的红利——一个 登纳德缩放定律 确保每一代新芯片都能带来更高的时钟频率的时代。但如今我们已触及 功耗墙。性能不再取决于频率；而是取决于 并发性。为了继续前进，我们必须运用 计算思维 来弥合抽象 数值方法 与现代 并行执行模型之间的鸿沟。

精度与性能的权衡

将一个 领域问题 （如分子动力学）从一个 多核主机 迁移到 CUDA设备 不仅仅是语法上的改变；更是一种 问题分解的转变。当我们进行并行化时，常常会改变操作的顺序。由于浮点数运算不具备结合律，我们面临一个权衡： 浮点数精度与准确性。并行计算的结果可能在数学上是正确的，但在数值上可能与串行版本产生偏差。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary reason the 'Sequential Ceiling' was reached?

The end of Moore's Law entirely.

Thermal limits and the Power Wall hindering frequency scaling.

Lack of developer interest in C++.

The transition to quantum computing.

QUESTION 2

According to Amdahl's Law, if 5% of a program is strictly sequential, what is the maximum theoretical speedup?

Infinite speedup.

Approximately 20x.

5x.

100x.

QUESTION 3

Why might a parallel Molecular Dynamics simulation yield slightly different results than a sequential one?

The CPU uses 64-bit while the GPU only uses 8-bit.

Floating-point addition is non-associative in parallel execution.

Parallel threads randomly skip calculations.

The CUDA compiler ignores numerical methods.

QUESTION 4

What does 'Problem Decomposition' involve in the context of parallel programming?

Breaking code into functions for readability.

Mapping domain-specific data to parallel execution models like threads or grids.

Deleting unnecessary variables to save memory.

Compiling the code for multiple OS targets.

QUESTION 5

Which of the following describes the 'Computational Thinking' bridge?

A hardware component between the CPU and GPU.

A framework to translate domain knowledge into architecture-aware algorithms.

An automated AI tool that writes CUDA kernels.

The process of upgrading RAM on a host machine.